Skip to main content

All Questions

3votes
2answers
44views

Required background for thorough understanding of Causal ML research papers?

I'm interested in pursuing research in the intersection of causal inference and machine learning, particularly on causal discovery and causal representation learning. Through my exploration so far, I ...
Harsh Shrivastava's user avatar
2votes
1answer
84views

How to deal with actions that complete in multiple steps (delayed reward) in reinforcement learning?

I have been exploring RL and using DQN to train an agent for a problem where i have two possible actions. But one of the action is supposed to complete over multiple steps while other one is ...
m101's user avatar
1vote
1answer
51views

Can I use minimax tree search over Q-values?

I'm trying to build a chess bot, and I'm trying to figure out if I can use Q-Values in a search tree by creating new nodes according to the number of possible moves, each with the corresponding Q-...
Leonhard Piff's user avatar
2votes
2answers
425views

What is the reinforcement learning reward function for reasoning in DeepSeek-R1

DeepSeek-R1 reports to have applied the Group Relative Policy Optimization where it rewards "accuracy". How is this accuracy measured for theorem proving? A proof can be stated in myriad ...
Hans's user avatar
  • 123
0votes
0answers
27views

Is my MAML implementation correct?

im trying to implement the MAML algorithm in the Reinforcement Learning domain but am not achieving fast adaptation to my validation tasks. I assume that something may be wrong with my meta loss ...
Mark Taylor's user avatar
0votes
0answers
39views

What’s the State of the Art in Traffic Light Control Using Reinforcement Learning? Ideas for Master’s Thesis?

I’m currently planning my Master’s thesis and I’m interested in the application of RL to traffic light control systems. I’ve come across research using different algorithms. However, I wanted to know: ...
Baki's user avatar
1vote
1answer
53views

What type of noise should I use with softmax activation?

I'm implementing a RL agent that navigates a graph. I'm using a softmax activation in the final layer of the actor network to model the action probabilities. To encourage exploration during training, ...
Baki's user avatar
0votes
1answer
22views

Unidentifiable flipped sign in policy gradient

Today I was building a VPG agent for a test and noticed it was getting worse not better over time so I flipped the reward during the training loop and lo and behold it learned. so obviously I started ...
Leonhard Piff's user avatar
1vote
1answer
77views

How do I correctly apply action masking during DDPG training in Python?

I'm implementing the Deep Deterministic Policy Gradient (DDPG) algorithm in PyTorch, and I'm facing issues with applying an action mask during the training process. Currently, I apply an action mask ...
Oriol Feliu's user avatar
0votes
3answers
56views

Why does TD3/DDPG use − 𝐸 [ 𝑄 ( 𝑠 , 𝜋 ( 𝑠 ) ) ] −E[Q(s,π(s))] as the policy loss without causing Q-values to go to infinity?

I tried to understand why TD3/DDPG use a policy loss of −E[Q(s,π(s))], which should make the policy maximize Q-values. I expected this to push Q-values to infinity over time, as there’s no explicit ...
Omar's user avatar
3votes
1answer
287views

Can two different non-optimal policies have the same value functions?

According Sutton and Barto second edition, page 79, policy improvement must give a better policy except when the policy is already optimal. This means that if two policies have the same value function ...
User1983's user avatar
1vote
1answer
135views

Is deep learning suitable/preferable for string similarity detection and application automation? If so, which type?

newbie here. I have developed an app that basically does: Perform OCR, check if words are contained in the resulting text and then perform an action. If no words are detected from the given list, ...
zaxunobi's user avatar
0votes
1answer
142views

Is reinforcement learning suitable for application automation?

I have basically automatised the use of an app through the use of OCR and computer vision. So basically when a word or an image is detected it will perform a certain action. When that action is ...
zaxunobi's user avatar
1vote
0answers
38views

Why completely two different algorithms are being used in Deep Q Learning?

I'm a new student in reinforcement learning. Recently, I've been studying about different algorithms of RL. But I'm quite surprized that there are some algorithms which are named as "same" ...
Jahid Chowdhury Choton's user avatar
1vote
0answers
25views

Enhancing Generalization in DRL Agents in Static Data Environments

Context: I'm working with a deep reinforcement learning (DRL) agent in a market-like environment where its actions do not affect the environment. The environment uses historical data up to a certain ...
ElonMuskofBadIdeas's user avatar

153050per page
close